6  Mixed Effects Models

“Everybody pulls for David, nobody roots for Goliath.” - Wilt Chamberlain

Wilt Chamberlain scores 100 points in a game between the Philadelphia Warriors and the New York Knicks on March 2, 1962. There were only about 4,000 people in attendance and the game was not televised. Chamberlain shot 36-63 from the floor and 28-32 from the free throw line.

In many sports-related data analyses, we encounter scenarios where the outcome is influenced by both fixed effects (predictors that have a constant influence) and random effects (predictors that vary across different levels of a factor, such as players, teams, or seasons). A mixed effects model (MEM) is a statistical model that accounts for both fixed and random effects.

For example, in analyzing basketball performance, fixed effects might include factors such as points scored per game (PPG) and minutes played, while random effects might include differences across players. Mixed effects models allow us to account for this variation, producing more accurate and generalizable results.

6.1 Fixed Effects

Fixed effects are factors that have a consistent influence across the entire dataset. These are the variables that you believe have a direct and constant effect on the outcome you are trying to predict. For example:

  • Points scored per game (PPG) in basketball might be influenced by minutes played. If a player plays more minutes, we might expect their total points to go up. This is a fixed effect because the relationship between minutes played and points scored is assumed to be the same for all players.

6.2 Random Effects

On the other hand, random effects account for the variability that can’t be easily explained by the fixed effects. These are the factors that differ across groups or levels, such as differences between individual players or teams that influence the outcome but aren’t easily predicted or measured by fixed factors. For example:

  • Player differences: Even if two players play the same number of minutes, they may score different numbers of points. These individual differences are captured by the random effects. Each player has their own specific contribution that doesn’t necessarily follow the same rule as the general population of players.

  • Team differences: Similarly, the performance of different teams can vary widely, even if they have similar stats. Random effects capture this variation. For example, some teams might perform better under pressure, while others might have unique strategies that affect game outcomes.

6.3 Mixed Effects

In a Mixed Effects Model, we combine both fixed and random effects to explain the outcome. This allows us to understand the general patterns (fixed effects) and the variations between groups (random effects).

The model equation:

\[ Y = X\beta + Z\gamma + \epsilon \]

In this equation:

  • \(Y\) is the outcome variable you want to predict (e.g., points per game, total rebounds, etc.).

  • \(X\beta\) represents the fixed effects. This part explains the predictable, overall factors that influence the outcome. In sports, this could be things like the average number of minutes a player plays or a player’s age.

  • \(Z\gamma\) represents the random effects. This part captures the differences between the groups or individuals. For example, player-specific differences, like one player being particularly good at making three-pointers, even if they play the same minutes as another player.

  • \(\epsilon\) is the residual error. This accounts for any remaining randomness in the data that the model can’t explain. It captures all the other little things that affect performance but aren’t included in the fixed or random effects.

6.3.1 Why Use a Mixed Effects Model?

A Mixed Effects Model is ideal when the data is hierarchical or clustered in some way, as in sports analytics. For example:

  • Data from different players within the same team.
  • Data from different teams within the same league.
  • Data from different seasons for a player or team.

Using MEMs helps us model the complexity of real-world scenarios where both overall trends (fixed effects) and individual variations (random effects) are important. By including random effects, we avoid oversimplifying the model, which could lead to inaccurate predictions or conclusions.

Imagine you’re trying to predict how many points a basketball player scores in a game. You might think that:

  • The more minutes a player plays, the more points they’ll score (this is a fixed effect).

  • But players differ in their skill levels (this is a random effect). Some players might be consistently better at scoring, regardless of the minutes they play, while others might not perform as well even if they play the same amount of time.

By using a Mixed Effects Model, we can combine these two influences to get a more accurate and personalized prediction of player performance.

6.4 The hoopR library

The hoopR package is used in R for accessing men’s basketball (both NBA and NCAA) data from ESPN and kenpom.com.

Below, we download player stats for every game during the 2022-2023 season.

6.5 Fixed Effect Only

Let’s start by doing a simple regression fit where points a player scores is the response and the number of minutes playes is the only predictor variable which is a fixed effect.

6.5.1 Setup Recipe

There are some NAs in the model so we will need to remove these.

rec = recipe(nba_data, points~minutes) |> 
  step_naomit(all_numeric_predictors())

dat = rec |> 
  prep(training = nba_data) |> 
  bake(new_data=NULL)

Scatterplot of data

dat |> ggplot(aes(x = minutes, y = points))+
  geom_point()

6.5.2 Setup model and fit

From the scatterplot, there appears to be an issue of heteroscedasticity. We will ignore that for now and fit a simple linear regression model.

slr_model = linear_reg() |> 
  set_engine("lm") |> 
  set_mode("regression")

slr_fit = slr_model |> fit(points~minutes,
                           data = dat)

glance(slr_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
1     0.550         0.550  6.11    33839.       0     1 -89267. 178540. 178564.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

We see that the coefficient of determination is 0.5503.

6.6 Including Player as Random Effect

Let’s now take into account the effect of the player. This would be a random effect.

In base R, the lmer library can be used for mixed effects model. We will use this as the engine in tidymodels. Note that the multilevelmod extension package is required to fit this model.

The formulas for mixed effects models indicate the fixed the effects just as in linear regression. That is, you specify the variable to the right of ~ separated by a +. For the random effects, the formula denotes them as surrounded by parentheses with 1|. In our example, athlete_display_name is the player’s name. So it will be the random effect.

mme_model = linear_reg() %>% 
  set_engine("lmer") 

Note that the only mode available for lmer is regression, thus we will leave the set_mode off.

We also need to update the recipe to include the athlete_display_name variable.

mme_rec = recipe(nba_data, points~minutes+ athlete_display_name) |> 
  step_naomit(all_numeric_predictors())

dat = mme_rec |> 
  prep(training = nba_data) |> 
  bake(new_data=NULL)

To obtain the summary of the fit from tidy or glance, we will need to use the broom.mixed library.

library(broom.mixed)

mme_fit = mme_model |>  
  fit(points ~ minutes + (1|athlete_display_name), data = dat)

tidy(mme_fit)
# A tibble: 4 × 6
  effect   group                term            estimate std.error statistic
  <chr>    <chr>                <chr>              <dbl>     <dbl>     <dbl>
1 fixed    <NA>                 (Intercept)       -0.890   0.173       -5.14
2 fixed    <NA>                 minutes            0.505   0.00422    120.  
3 ran_pars athlete_display_name sd__(Intercept)    3.36   NA           NA   
4 ran_pars Residual             sd__Observation    5.00   NA           NA   
glance(mme_fit)
# A tibble: 1 × 7
   nobs sigma  logLik     AIC     BIC REMLcrit df.residual
  <int> <dbl>   <dbl>   <dbl>   <dbl>    <dbl>       <int>
1 27650  5.00 -84523. 169055. 169088.  169047.       27646

Note the reduction in sigma and AIC from the model without the random effect.

6.7 Interpreting the effects

When interpreting the random effects for player-specific deviations in our example:

Variance of Random Effects (Player): The variance of the random effect for each player tells us how much a player’s performance (points scored) deviates from the overall average. If the variance is large, it means that there is considerable variability between players in how they perform, even after accounting for the number of minutes they play.

Random Intercepts: In the random effect model (1|athlete_display_name), the intercepts represent how much each player’s baseline points scored differs from the average baseline. If a player’s random intercept is large (either positive or negative), it means they consistently score more or fewer points than the average player, regardless of the number of minutes they play.

For our example, the estimate for the random effect due to player is 3.361. So a player’s points deviate the overall mean points by about 3.361 points.

To obtain the random intercepts, we can use the ranef function from the lme4 library.

library(lme4)
random_effects = ranef(mme_fit$fit)

random_effects$athlete_display_name |>
  arrange(desc(`(Intercept)`)) |> 
  head(n=10)
                        (Intercept)
Joel Embiid                14.77061
Giannis Antetokounmpo      14.74653
Luka Doncic                14.22519
Damian Lillard             14.12978
Shai Gilgeous-Alexander    13.74390
Stephen Curry              12.28285
Jayson Tatum               11.35835
Devin Booker               11.16106
Kevin Durant               10.88659
Donovan Mitchell           10.39647